# Efficient Hardware Implementation of MQ Asymmetric Cipher PMI+ on FPGAs

Shaohua Tang\*, Bo Lv\*\*, Guomin Chen\*\*, and Zhiniang Peng

School of Computer Science & Engineering,
South China University of Technology, Guangzhou, China
shtang@IEEE.org,
csshtang@scut.edu.cn

**Abstract.** PMI+ is a Multivariate Quadratic (MQ) public key algorithm used for encryption and decryption operations, and belongs to post quantum cryptography. We designs a hardware on FPGAs to efficiently implement PMI+ in this paper. Our main contributions are that, firstly, a hardware architecture of encryption and decryption of PMI+ is developed, and description of corresponding hardware algorithm is proposed; secondly, basic arithmetic units are implemented with higher efficiency that multiplication, squaring, vector dot product and power operation are implemented in full parallel; and thirdly, an optimized implementation for core module, including optimized large power operation, is achieved. The encryption and decryption hardware of PMI+ is efficiently realized on FPGA by the above optimization and improvement. It is verified by experiments that the designed hardware can complete an encryption operation within 497 clock cycles, and the clock frequency can be up to 145.6MHz, and the designed hardware can complete a decryption operation within 438 clock cycles wherein the clock frequency can be up to 37.04MHz.

**Keywords:** Multivariate Quadratic (MQ) Public Key Algorithm, PMI+ Encryption and Decryption, Hardware Implementation, FPGA, Optimized Large Power Operation.

#### 1 Introduction

Public key cryptography has played an important role in modern communication and computer networks. The public key cryptography, which is used widely, mainly includes RSA based on integer factorization problem, ElGamal based on discrete logarithm problem and elliptic curve cryptography, etc. In order to adapt various occasions, many efficiently hardware implementations are proposed by researchers [22,18,25,19,23,14,8].

The quantum algorithm of P.Shor is able to solve the integer factorization and discrete logarithm problem in polynomial time, including a calculation problem in elliptic curve field, which directly threatens classical cryptosystems based on

<sup>\*</sup> Corresponding Author.

<sup>\*\*</sup> The authors Bo Lv and Guomin Chen contribute equally to this paper.

X. Huang and J. Zhou (Eds.): ISPEC 2014, LNCS 8434, pp. 187-201, 2014.

<sup>©</sup> Springer International Publishing Switzerland 2014

hard problems of number theory, and which helps to drive the development of post quantum cryptography. The post quantum cryptography can be divided into four categories: signature schemes based on hash function[17], lattice-based public key cryptosystem[13], public key cryptosystem based on error correcting code[16] and multivariate public key cryptosystem[7]. The research for post quantum cryptography is growing rapidly and many hardware and embedded system implementations of the post quantum cryptography appear in order to adapt various occasions[24,21,11,20,1,2,3,6].

PMI+ [5] is one kind of multivariate public key cryptosystem, and is a variant of MI[15]. Ding enhanced the security of MI by adding internal perturbation to the central map of MI in 2004, to produce a new variant of the MI cryptosystem which is called PMI cryptosystem[4]. However, the PMI cryptosystem has been broken by differential cryptanalysis by Fouque et al.[10] in 2005. Ding introduced new external perturbation to the central mapping of MI [5] in 2006, to produce PMI+ cryptosystem whose security has been greatly improved. Up to present, the PMI+ cryptosystem is still secure, and its hardware implementation is relatively less, so a hardware used to implement PMI+ is designed in this paper, which can be efficiently implemented in FPGA.

**Our Contributions.** The paper designs a hardware used to implement PMI+, which can be efficiently implemented on FPGA.

Firstly, a hardware architecture of encryption and decryption of PMI+ is developed, and description of corresponding hardware algorithm is proposed.

Secondly, basic arithmetic units are implemented with higher efficiency that multiplication, squaring, vector dot product and power operation are implemented in full parallel, wherein compared with a full parallel multiplier, a full parallel squarer takes up about one-twentieth of the logical unit and has shorter latency.

Thirdly, we implement an optimized large power operation, and compared with general power operation, it can reduce 4288 cycles at most in one process of decryption, with an obvious optimization. The encryption and decryption hardware of PMI+ is efficiently realized on FPGA by the above optimization and improvement.

Our experiments verify that if parameters are selected as  $(n, q, \theta, r, a) = (84, 2, 4, 6, 14)$ , the length of a plaintext block is 84 bits and the length of a ciphertext block is 98 bits, our designed hardware can complete an encryption operation within 497 clock cycles or 3.42us, wherein the clock frequency can be up to 145.6 MHz, and our designed hardware can complete an decryption operation within 438 clock cycles or 11.83us, wherein the clock frequency can be up to 37.04 MHz.

**Organization.** The structure of the rest of this paper goes as follows. Section 2 briefly introduces solution and theory of PMI+ encryption scheme, including the construction of algorithms, principles of encryption and decryption and the choice of parameters; Section 3 primarily focuses on hardware design and implementation of PMI+, including hardware structure design, algorithm description and implementation of basic arithmetic unit and hardware core module; Section 4

lists detailed experimental data, and makes performance contrast with other public key encryption schemes; and Section 5 is the conclusion of this paper, which summarizes the findings of this paper and proposes further research directions.

## 2 Preliminaries

We describes the basic theory of the encryption and decryption of PMI+ [5] in this section. The basic idea of PMI+ is adding internal perturbation and external perturbation to the central map of MI scheme to resist linearization equation attack and differential attack.

#### 2.1 Notations for PMI+

Let k be a finite field of characteristic two and cardinality q, K be an extension of degree n over k. Let  $\varphi: K \to k^n$  defined by  $\varphi(a_0 + a_1x + ... + a_{n-1}x^{n-1}) = (a_0, a_1, ..., a_{n-1}).$ 

Fix  $\theta$  so that  $\gcd(q^{\theta}+1, q^n-1)=1$  and define  $\tilde{F}: K \to K$  by  $\tilde{F}(X)=X^{1+q^{\theta}}$ . Then F is invertible and  $\tilde{F}^{-1}(X)=X^t$ , where  $t(1+q^{\theta})\equiv 1 \mod (q^n-1)$ .

Define the map  $F': k^n \to k^n$  by  $F'(x_1, ..., x_n) = \varphi \circ \tilde{F} \circ \varphi^{-1}(x_1, ..., x_n)$ .

Fix a small integer r and randomly choose r invertible affine linear functions  $z_1,...,z_r$ , written as  $z_j(x_1,...,x_n)=\sum\limits_{i=1}^n\alpha_{ij}x_i+\beta_j$ , for j=1,...,r. This defines a map  $Z:k^n\to k^r$  by  $Z(x_1,...,x_n)=(z_1,...,z_r)$ . The map Z is source of internal perturbation.

Randomly choose n quadratic polynomials  $\hat{f}_1,...,\hat{f}_n \in k[z_1,...,z_r]$ . The  $\hat{f}_i$  define a map  $\hat{F}: k^r \to k^n$  by  $\hat{F}(z_1,...,z_r) = (\hat{f}_1,...,\hat{f}_n)$ . Let P be the set consisting of the pairs  $(\lambda,\mu)$ , where  $\lambda$  is a point that belongs to the image of  $\hat{F}$  and  $\mu$  is the set of pre-images of  $\lambda$  under  $\hat{F}$ .

Define an internal perturbation map by  $F^*(x_1,...,x_n) = \hat{F} \circ Z(x_1,...,x_n) = (f_1^*,...,f_n^*)$ . Define a map by  $F(x_1,...,x_n) = (F'+F^*)(x_1,...,x_n)$ .

Randomly choose a non-linear equations on  $x_1,...,x_n$  for the central map F as external perturbation. Randomly choose an invertible affine map  $L_1$  in n+a dimensional vector space  $k^{n+a}$ , randomly choose an invertible affine map  $L_2$  in n dimensional vector space  $k^n$ , and  $\bar{F}(x_1,...,x_n)=L_1\circ F\circ L_2(x_1,...,x_n)$  is a public key of PMI+, and the private key includes the central map F', the map  $\hat{F}, Z, L_1^{-1}$  and  $L_2^{-1}$ .

## 2.2 PMI+ Encryption

For a given plaintext block  $(x_1, ..., x_n)$ , when encrypting the plaintext, it only needs to apply the plaintext into the public key polynomial

$$y_1 = \bar{f}_1(x_1, x_2, ..., x_n),$$
...
$$y_{n+a} = \bar{f}_{n+a}(x_1, x_2, ..., x_n),$$
(1)

to calculate the evaluation of n+a quadratic polynomials that a ciphertext  $(y_1,...,y_{n+a})$  can be acquired.

## 2.3 PMI+ Decryption

We can decrypt the ciphertext  $(y_1, ..., y_{n+a})$  by computing

$$X = (x_1, ..., x_n) = L_2^{-1} \circ F^{-1} \circ L_1^{-1}(y_1, ..., y_{n+a}).$$
 (2)

The process is:

- (1) calculating  $Y' = L_1^{-1}(Y) = (y'_1, ..., y'_{n+a});$
- (2) removing a external perturbation polynomials from Y' to obtain  $\bar{Y} = (\bar{y}_1, ..., \bar{y}_n)$ ;
- (3) calculating  $(y_{\lambda 1},...,y_{\lambda n}) = F^{-1}((\bar{y}_1,...,\bar{y}_n) + \lambda)$  for each  $(\lambda,\mu) \in P$ , and checking if  $\mu = Z(y_{\lambda 1},...,y_{\lambda n})$ , if not, continuing this step, otherwise, moving on to the next step;
- (4) applying  $(y_{\lambda 1}, ..., y_{\lambda n})$  into a external perturbation polynomials, if the verification is successful, moving on to the next step, otherwise, returning to the previous step; and
- (5) calculating  $X = L_2^{-1}(y_{\lambda 1}, ..., y_{\lambda n}) = (x_1, ..., x_n)$ , and X is a decrypted plaintext.

## 2.4 Security and Parameter Selection of PMI+

The obtained PMI+ instance can reach a corresponding security level after associated parameters are set. For example, Ding [5] has shown two sets of relatively practical PMI+ parameters in his paper that the security level can be up to over  $2^{80}$ , and the following table shows the two sets of parameters.

| $\overline{n}$ | q | r | a | θ  |
|----------------|---|---|---|----|
| 84             | 2 | 4 | 6 | 14 |
| 136            | 2 | 8 | 6 | 18 |

**Table 1.** Parameters for PMI+

The parameters for PMI+ encryption and decryption hardware implemented in this paper is as the first set of parameters shown in Table 1, and the security level can be up to over  $2^{80}$ .

## 3 Design and Implementation of PMI+ Hardware

#### 3.1 Hardware Structure Design and Algorithm Process

**Design of PMI+ Encryption.** The hardware structure of PMI+ encryption is as shown in Fig. 1.

It can be shown from (1) in Section 2.2, the operation process of PMI+ encryption is equivalent to applying the plaintext into the polynomial to calculate, and its hardware structure is illustrated in Fig. 1.



Fig. 1. The Hardware Structure of PMI+ Encryption

If parameters are selected as  $(n, q, \theta, r, a) = (84, 2, 4, 6, 14)$ , the length of the plaintext block is 84 bits and the length of the ciphertext block is 98 bits, it needs to add 14 external perturbations, the public key is 358,190 (3,655\*98) bits, i.e. 44,774 bytes.

**Design of PMI+ Decryption.** The hardware structure of PMI+ Decryption is shown in Fig. 2.



Fig. 2. The hardware structure of PMI+ Decryption

From Section 2.3, the process of decryption is equivalent to calculating Eq. (2) in Section 2.3. The process of PMI+ decryption is divided into four modules based on the process of calculating Eq. (2): affine transformation, internal perturbator, large power operation and polynomial calculation, as shown in Fig. 2. Wherein, the input of large power operation is a result of the affine transformed result adding the internal perturbator. The role of the polynomial calculation is to verify external perturbator, if the verification is successful, the result will be calculated in the affine transformation module again to obtain the plaintext block, otherwise, to select another element from the internal perturbator for large power operation after addition.

Based on Eq. (2), the process of PMI+ decryption can be abstracted into two parts: affine transformation and decryption mapping. In the parameters we selected, the process is that: firstly the ciphertext is operated by  $L_1^{-1}$  affine transformation, wherein the parameter is 98 bits; then the result of the  $L_1^{-1}$  affine transformation is mapped by decryption mapping algorithm to the plaintext space, and the result is 84 bits; finally the result of the PMI+ decryption mapping is operated by  $L_2^{-1}$  affine transformation, and a 84-bit plaintext block is obtained.

#### 3.2 Basic Arithmetic Unit

Firstly, the basic arithmetic unit throughout the process of PMI+ encryption and decryption is described here.

**Full Parallel Multiplier.** Elements in the finite field K can be expressed by a polynomial as  $a = \sum_{i=0}^{83} a_i x^i$ , where  $a_i \in \{0,1\}$ . And a multiplication over the finite field can be expressed by  $c = a \otimes b \mod R(x) = M \mod R(x)$ .

One large field multiplication can be completed in one clock cycle by an ordinary multiplication algorithm based on standard basis which contains merging similar items and conducting modulus reduction, and the main computation in the algorithm is on modulus reduction. It can be pre-processed with external program. The full parallel multiplier is structured as follows.

$$\begin{split} m_0 &= a_0 \otimes b_0, \\ m_1 &= (a_0 \otimes b_1) \oplus (a_1 \otimes b_0), \\ \dots \\ m_{165} &= (a_{82} \otimes b_{83}) \oplus (a_{83} \otimes b_{82}), \\ m_{166} &= a_{83} \otimes b_{83}; \\ c_0 &= m_0 \oplus m_{84} \oplus \dots \oplus m_{166}, \\ c_1 &= m_1 \oplus m_{85} \oplus \dots \oplus m_{166}, \\ \dots \\ c_{83} &= m_{83} \oplus m_{110} \oplus \dots \oplus m_{165}. \end{split}$$

The full parallel multiplier can complete one multiplication over the finite field K in one clock cycle, which uses 7,056 AND gate circuits and 9,997 XOR gate circuits. It was unrealistic to implement a direct look-up table over finite field  $GF(2^{84})$  (the storage space of the table can be up to  $2^{168}$  bits), in comparison the full parallel multiplier over the finite field K implemented in this paper should be the better.

Full Parallel Vector Dot Product. In the process of PMI+ decryption, the affine transformation is used twice, where in the first time, n is 98, and in the second time, n is 84. It needs to implement two vector dot products: a 98 dimensional vector dot product and a 84 dimensional vector dot product. The scalar value in the vector is 0 or 1, so for scalar value in the vector, the addition uses a XOR gate circuit, and the multiplication uses a AND gate circuit.

Set  $a = (a_0, ..., a_{n-1}), b = (b_0, ..., b_{n-1}),$  where  $a_i, b_i \in \{0, 1\}, i = 0, ..., n-1,$  and the dot product of vectors is  $c \in \{0, 1\} : c = (a_0 \otimes b_0) \oplus ... \oplus (a_{n-1} \otimes b_{n-1}).$ 

The vector dot product operation can be completed in one clock cycle, which uses n AND gate circuits and n-1 XOR gate circuits.

**Full Parallel Squarer.** There is a very useful property in Frobenius mapping that for a map  $T_i(X) = X^{q^i}$  over the finite field K, X is represented as a polynomial basis  $a_0 + a_1 x + ... + a_{83} x^{83}$ , and then the following equation holds:  $T_i(X) = X^{q^i} = a_0 + a_1 x^{q^i} + ... + a_{83} x^{83*q^i}$ .

While in the finite field K, q=2, set  $a=\sum_{i=0}^{83}a_ix^i$  as any element in K, then: $a^2=a_0+a_1x^2+...+a_{83}x^{83*2}$ . It can be pre-processed with external program. The full parallel squarer has the following hardware structure.

$$\begin{aligned} c_0 &= a_0 \oplus a_{42} \oplus \ldots \oplus a_{83}, \\ c_1 &= a_{56} \oplus a_{61} \oplus \ldots \oplus a_{83}, \\ \ldots \\ c_{83} &= a_{55} \oplus a_{60} \oplus \ldots \oplus a_{82}. \end{aligned}$$

The full parallel squarer can complete one squaring operation over the finite field K in one clock cycle, which uses 1,525 XOR gate circuits. Compared with a full parallel multiplier, the full parallel squarer uses about one in twenty logical units, and has a shorter latency, so it seems worthwhile to implement the full parallel squarer.

Full Parallel Power Operator. In order to implement the large power operation efficiently and reuse public arithmetic unit at the most extent, two power operators are implemented, where one is a full parallel power 16 operator and the other is a full parallel power 256 operator. Based on the nature of Frobenius mapping, set a as any element in K, and then:

$$a^{16} = a_0 + a_1 x^{16} + \dots + a_{83} x^{83*16},$$
  
 $a^{256} = a_0 + a_1 x^{256} + \dots + a_{83} x^{83*256}.$ 

It can be pre-processed with external program. The full parallel power 16 operator has the following hardware structure:

$$\begin{split} c_0 &= a_0 \oplus a_9 \oplus \ldots \oplus a_{81}, \\ c_1 &= a_1 \oplus a_7 \oplus \ldots \oplus a_{83}, \\ \ldots \\ c_{83} &= a_8 \oplus a_{10} \oplus \ldots \oplus a_{83}. \end{split}$$

The full parallel power 256 operator has the following hardware structure:

$$d_0 = a_0 \oplus a_4 \oplus \dots \oplus a_{83},$$
  

$$d_1 = a_1 \oplus a_2 \oplus \dots \oplus a_{83},$$
  

$$\dots$$
  

$$d_{83} = a_2 \oplus a_3 \oplus \dots \oplus a_{83}.$$

The full parallel power operator that we implemented can complete one exponentiation over the finite field K in one clock cycle. Compared with a full parallel multiplier, the full parallel squarer uses about one in tenth logical units, and has a shorter latency.

#### 3.3 Implementation of Hardware Core Modules

Implementation of Polynomial Calculation. The calculation of polynomial can be an addition or multiplication over finite field GF(2), which can be implemented by XOR operation and AND operation respectively. The polynomial calculation module is used in both PMI+ encryption and decryption. Wherein, in PMI+ encryption, the input of the polynomial calculation module is a plaintext block of PMI+ and a public key polynomial, the output Y of the polynomial calculation module is a ciphertext block, and the role of the polynomial calculation module is to implement PMI+ encryption; in PMI+ decryption, the input of the polynomial calculation module is a result of the large power operation and a external perturbation polynomials of PMI+, the output of the polynomial calculation module is a result of PMI+ decryption mapping, and the role of the polynomial calculation module is to verify a external perturbation polynomials.

Implementation of Affine Transformation. The affine transformation includes a vector addition and a vector dot product. The vector addition can be implemented by XOR operation directly. The vector dot product can be implemented by the full parallel vector dot product defined by us. In the PMI+decryption, two affine transformations are used, respectively before and after decryption mapping, the first uses a 98 dimensional vector dot product, and the second uses a 84 dimensional vector dot product.

Implementation of Internal Perturbator. When we implement the PMI+ decryption, it needs to abstract a component to complete a transformation for mapping from r=6 dimensional vector to 84 dimensional vector, which is called as internal perturbator. The expression of the map is calculated by a external program off-line, and the arithmetic unit is implemented by 1,078 XOR gates and 627 AND gates.

Implementation of Large Power Operation. In one PMI+ decryption, it needs 64 large power operations at most, so optimized large power operation can improve the performance of the PMI+ decryption hardware at a large extent. If the parameter t is selected as 10240312824970976538687608, it is unrealistic to find the solution of power by multiplication over the finite field K.

Conventional Large Power Operation. The large power operation is implemented by a "square-multiplication" method. The binary equivalent for t is 10000111 10000111 10000111 10000111 10000111 10000111 10000111

10000111 1000,  $X^t$  can be expressed as  $B^t = B^{2^3} \otimes B^{2^4} \otimes ... \otimes B^{2^{83}}$ , so one large power operation can be completed by 83 squaring operations and 40 multiplications.

Optimal Implementation of Large Power Operation. The basic idea to implement large power operation is reusing public arithmetic unit at the most extent, so as to reduce clock cycles of the large power operation. The software implementation of PMI+ has been completed in a 8051 microcontroller by Chen [26] in his master dissertation, where the large power operation uses a similar idea. The differences between the above paper and this paper are that the size of t used in this paper is different (methods for optimization are different), and the implementation of PMI+ in this paper is based on FPGA hardware platform.

We find that fragment S=10000111 appears 10 times in the binary string, so  $X^t$  can be expressed as  $X^t=X^{2^3}\otimes (X^S)^{16}\otimes ((X^S)^{16})^{256}\otimes ...\otimes ((X^S)^{16})...)^{256}$ . In the optimized large power operation,  $X^S=X^{10000111}$  is firstly calculated,

In the optimized large power operation,  $X^S = X^{10000111}$  is firstly calculated, and we implement it for optimization that  $X^S$  can be calculated in 5 cycles. Then, the operation of  $X^t$  can be quickly completed by adding new arithmetic unit, and the rest of the operation can be completed in 11 clock cycles.

## Algorithm 1. Optimal Implementation of Large Power Operation

```
Input: X;
    Output: Y;
    Procedure:
 1 begin
         B2 := \operatorname{square}(X);
 \mathbf{2}
         B4 := \operatorname{square}(B2); Y := \operatorname{multiply}(X, B2);
 3
         B8 := \operatorname{square}(B4); Y := \operatorname{multiply}(Y, B4);
 4
         B128 := power16(B8); B16 := square(B8);
 5
 6
         B135 := \text{multiply}(B128, Y);
 7
         tmp := power16(B135);
         Y := \text{multiply}(B16, \text{tmp}); \text{tmp} = \text{power256(tmp)};
 8
         i = 8;
 9
         while i >= 0 do
10
              Y := \text{multiply}(Y, \text{tmp}); \text{tmp} = \text{power256(tmp)};
11
12
         end
13
         Y := \text{multiply}(Y, \text{tmp});
14
         return Y;
15
16 end
```

Algorithm 1 describes the process of the optimized large power operation. The input X is a 84 dimensional vector, the output Y is also a 84 dimensional vector, the arithmetic units of **square** and **multiply** are a full parallel squarer and a full parallel multiplier respectively, and the arithmetic units of **power16** 

and **power256** are a full parallel power 16 operator and a full parallel power 256 operator respectively.

Our new proposed large power operation can complete a large power operation in 16 clock cycles which are less than one-sixth of those for "square-multiplication" method, and only the logical units taken up by the arithmetic units of **power16** and **power256** increases for its area, so the operational performance is greatly enhanced.

## 4 Experiment Results and Analyses

The algorithm of PMI+ encryption and decryption is implemented in Quartas II 8.0 environment by VHDL with the idea of high speed and parallelization, its hardware simulation is implemented in EP2S130F102014 of the family of StratixII, and the area of PMI+ encryption and decryption hardware is evaluated by SynopsysDC, where the process library is 0.18 nm process library of TSMC and the working voltage is 1.62 volt. The following results come from the real experimental data and compared with current implemented hardware in performance.

#### 4.1 PMI+ Basic Arithmetic Unit

Some basic arithmetic units of PMI+ are implemented in Section 3.2, including a full parallel multiplier, a full parallel vector dot product, a full parallel squarer and a full parallel power operator, and the performance data of these basic arithmetic units is shown in Table 2.

| Arithmetic Units      | Area     | Number of       | Logical Unit | Maximum | Clock  |
|-----------------------|----------|-----------------|--------------|---------|--------|
|                       | $(um^2)$ | Equivalent Gate | (ALUT)       | Latency | Cycles |
| Full Parallel Multi-  | 277997.2 | 27800           | 4823         | 27.000  | 1      |
| plier                 |          |                 |              |         |        |
| Full Parallel Vector  | 3559.25  | 356             | 57           | 23.948  | 1      |
| Dot Product           |          |                 |              |         |        |
| Full Parallel Squarer | 19546    | 1955            | 289          | 17.483  | 1      |
| Full Parallel Power   | 37115.8  | 3712            | 510          | 18.641  | 1      |
| 16 Operator           |          |                 |              |         |        |
| Full Parallel Power   | 39045    | 3905            | 538          | 19.191  | 1      |
| 256 Operator          |          |                 |              |         |        |

Table 2. The Performance of PMI+'s Basic Arithmetic Units

It can be seen from the data in Table 2 that the basic arithmetic unit in PMI+ decryption hardware can complete one basic operation in one cycle, where the full parallel multiplier takes up the maximum area, and compared with the multiplier, the squarer and power operator complete an operation with lower latency while take up less area.

#### 4.2 Large Power Operation in PMI+

A comparison of the number of logical units and the number of clock cycles between two different large power operations is listed in Table 3.

| Arithmetic Units                                          | $Area/(um^2)$ | Number of<br>Equivalent Gate | Logical Unit<br>(ALUT) | Clock<br>Cycles |
|-----------------------------------------------------------|---------------|------------------------------|------------------------|-----------------|
| Implementation Based<br>on "Square - Multiplica-<br>tion" | 334715        | 33472                        | 5176                   | 84              |
| Our Optimal Implementation of Large Power Operation       | 435941        | 43595                        | 6367                   | 16              |

Table 3. Performance Comparison between two Large Power Operation Methods

These results show that the performance of the optimized large power operation has a significant improvement that clock cycles of the optimized large power operation reduce by 80.9% and the area adds about 30.2% for one large power operation. In PMI+ decryption, it needs 64 large power operations at most, so it can save up to 4,416 clock cycle at most for a period of decryption.

## 4.3 PMI+ Encryption and Decryption

We implement the first PMI+ encryption and decryption hardware on FPGA. Compared with other public key encryption and decryption hardware, our hardware implementation of PMI+ possesses of advantages such as small space, fast speed of encryption and decryption, and practical security level.

The whole PMI+ decryption needs at least 207 clock cycles (excepting cycles of reading ROM) to complete a signature operation, and it takes up a total of 11,005 logical units with a area of 680,302  $um^2$ .

Table 5 lists performance data of the PMI+ encryption and decryption hardware. Using experiment data, it's easy to see the number of cycles of the PMI+ decryption is mutable, where 438 cycles for least and 2,915 cycles for most, and the running speed of the PMI+ encryption hardware is far faster than that of the PMI+ decryption hardware and the area of it is far less than the PMI+ decryption hardware.

#### 4.4 Performance Comparison

The performances of the PMI+ decryption hardware is compared with other public key cryptosystem hardware in this section. Table 6 lists the results after comparing the implementation of the PMI+ decryption hardware with other public key cryptosystems.

| Step | Main Arithmetic Units                                             | Clock Cycles |
|------|-------------------------------------------------------------------|--------------|
| 1    | Calculate the Invertible Affine Map Function of $L_1^{-1}$        | 85           |
| 2    | Sum of the Affine Transformed Result and the Enternal Perturbator | 1 - 64       |
| 3    | Large Power Operation                                             | 16 - 1024    |
| 4    | Calculate the Map of $Z$                                          | 6 - 384      |
| 5    | Calculate the Invertible Affine Map Function of $L_2^{-1}$        | 85           |
| 6    | Chook Futus Polymomial                                            | 14 56        |

Table 4. Cycles Required by Arithmetic Units in PMI+ Encryption

Table 5. Performance of our PMI+ Encryption and Decryption

| Hardware | Area     | Number of  | Logical | Clock     | Period | Clock | Total |
|----------|----------|------------|---------|-----------|--------|-------|-------|
| Imple-   | $(um^2)$ | Equivalent | Unit    | Frequency | (ns)   | Cy-   | Time  |
| menta-   |          | Gate       | (ALUT)  | (MHz)     |        | cles  | (us)  |
| tion     |          |            |         |           |        |       |       |
| PMI+     | 160385   | 16039      | 3468    | 145.60    | 6.868  | 497   | 3.42  |
| Encryp-  |          |            |         |           |        |       |       |
| tion     |          |            |         |           |        |       |       |
| PMI+     | 680302   | 68031      | 11005   | 37.04     | 27.000 | 438 - | 11.83 |
| Decryp-  |          |            |         |           |        | 2915  | -     |
| tion     |          |            |         |           |        |       | 78.71 |

Table 6. Performance Comparison among some Public Key Crypto Hardwares

| The Hardware     | Number of  | Clock Cy-  | Frequency | Total time | Area*time |
|------------------|------------|------------|-----------|------------|-----------|
| Implementation   | Equivalent | cles       | (MHz)     | (us)       |           |
| Scheme           | Gate       |            |           |            |           |
| RSA1024-         | 250000     | 357142     | 200       | 1785.71    | 554.70    |
| PSS[12]          |            |            |           |            |           |
| ECC128[9]        | 183000     | 592976     | 204       | 2910       | 661.69    |
| EN-TTS[27]       | 21000      | 60000      | 67        | 895.53     | 23.37     |
| Our Parallelized | 68031      | 438 - 2915 | 37.04     | 11.83 -    | 1 - 6.66  |
| PMI+ Decryp-     |            |            |           | 78.71      |           |
| tion             |            |            |           |            |           |

The data in the table shows that compared with RSA and ECC, parallelization PMI+ decryption hardware that we implemented has a higher performance advantage, such as small product of area and time, and high operating efficiency.

## 5 Conclusion

We design a hardware on FPGAs used to efficiently implement PMI+. It is verified by experiments that our designed hardware can complete an encryption operation within 497 clock cycles, and the clock frequency can be up to 145.6MHz,

and the designed hardware can complete a decryption operation within 438 clock cycles wherein the clock frequency can be up to 37.04MHz. Our main contributions are to develop hardware architecture of encryption and decryption of PMI+ and describe corresponding hardware algorithms. Meanwhile, basic arithmetic units are implemented in this paper with higher efficiency which can complete the operation with lesser latency. Thirdly, an optimized large power operation is implemented which needs only 16 cycles to complete one exponentiation, and compared with general power operation, it can reduce 4288 cycles at most in one process of decryption, with an obvious optimization.

Future studies will include: 1) using registers in hardware more accurately to reduce the area and power consumption of hardware; and 2) reducing the number of logical units of multiplier and latency on the premise that the clock cycles do not increase.

**Acknowledgment.** This work is supported by the National Natural Science Foundation of China under Grant No. U1135004 and 61170080, Guangdong Province Universities and Colleges Pearl River Scholar Funded Scheme (2011), and High-level Talents Project of Guangdong Institutions of Higher Education (2012).

### References

- Balasubramanian, S., Carter, H., Bogdanov, A., Rupp, A., Ding, J.: Fast Multivariate Signature Generation in Hardware: The Case of Rainbow. In: Application-Specific Systems, Architectures and Processors, pp. 25–30 (July 2008)
- Bogdanov, A., Eisenbarth, T., Rupp, A., Wolf, C.: Time-Area Optimized Public-Key Engines: MQ-Cryptosystems as Replacement for Elliptic Curves? In: Oswald, E., Rohatgi, P. (eds.) CHES 2008. LNCS, vol. 5154, pp. 45–61. Springer, Heidelberg (2008)
- Czypek, P., Heyse, S., Thomae, E.: Efficient Implementations of MQPKS on Constrained Devices. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 374–389. Springer, Heidelberg (2012)
- Ding, J.: A New Variant of the Matsumoto-Imai Cryptosystem through Perturbation. In: Bao, F., Deng, R., Zhou, J. (eds.) PKC 2004. LNCS, vol. 2947, pp. 305–318. Springer, Heidelberg (2004)
- Ding, J., Gower, J.E.: Inoculating Multivariate Schemes Against Differential Attacks. In: Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 290–301. Springer, Heidelberg (2006)
- Ding, J., Schmidt, D., Yin, Z.: Cryptanalysis of the New TTS Scheme in CHES 2004. International Journal of Information Security 5(4), 231–240 (2006)
- 7. Ding, J., Yang, B.Y.: Multivariate Public Key Cryptography. In: Bernstein, D., Buchmann, J., Dahmen, E. (eds.) Post-Quantum Cryptography, pp. 193–241. Springer, Heidelberg (2009)
- 8. Fan, J., Vercauteren, F., Verbauwhede, I.: Efficient Hardware Implementation of Fp-Arithmetic for Pairing-Friendly Curves. IEEE Transactions on Computers 61(5), 676–685 (2012)

- 9. Fan, J., Vercauteren, F., Verbauwhede, I.: Faster Fp-arithmetic for Cryptographic Pairings on Barreto-Naehrig Curves. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 240–253. Springer, Heidelberg (2009)
- Fouque, P.-A., Granboulan, L., Stern, J.: Differential Cryptanalysis for Multivariate Schemes. In: Cramer, R. (ed.) EUROCRYPT 2005. LNCS, vol. 3494, pp. 341–353. Springer, Heidelberg (2005)
- Ghosh, S., Verbauwhede, I.: BLAKE-512 Based 128-bit CCA2 Secure Timing Attack Resistant McEliece Cryptoprocessor. IEEE Transactions on Computers PP(99), 1 (2012)
- Großschädl, J.: High-Speed RSA Hardware Based on Barrets Modular Reduction Method. In: Koç, Ç.K., Paar, C. (eds.) CHES 2000. LNCS, vol. 1965, pp. 191–203. Springer, Heidelberg (2000)
- Hoffstein, J., Pipher, J., Silverman, J.H.: NTRU: A Ring-Based Public Key Cryptosystem. In: Buhler, J.P. (ed.) ANTS 1998. LNCS, vol. 1423, pp. 267–288.
   Springer, Heidelberg (1998)
- 14. Mahdizadeh, H., Masoumi, M.: Novel Architecture for Efficient FPGA Implementation of Elliptic Curve Cryptographic Processor Over  $GF(2^{163})$ . IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21(12), 2330–2333 (2013)
- Matsumoto, T., Imai, H.: Public Quadratic Polynomial-Tuples for Efficient Signature-Verification and Message-Encryption. In: Günther, C.G. (ed.) EUROCRYPT 1988. LNCS, vol. 330, pp. 419–453. Springer, Heidelberg (1988)
- McEliece, R.J.: A Public-Key Cryptosystem Based on Algebraic Coding Theory. DSN Progress Report 42(44), 114–116 (1978)
- Merkle, R.C.: Secrecy, Authentication, and Public Key Systems. Ph.D. thesis, Stanford University (1979)
- Miyamoto, A., Homma, N., Aoki, T., Satoh, A.: Systematic Design of RSA Processors Based on High-Radix Montgomery Multipliers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 19(7), 1136–1146 (2011)
- Rebeiro, C., Roy, S.S., Mukhopadhyay, D.: Pushing the Limits of High-Speed GF(2<sup>m</sup>) Elliptic Curve Scalar Multiplication on FPGAs. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 494–511. Springer, Heidelberg (2012)
- Shih, J.R., Hu, Y., Hsiao, M.C., Chen, M.S., Shen, W.C., Yang, B.Y., Wu, A.Y., Cheng, C.M.: Securing M2M With Post-Quantum Public-Key Cryptography. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 3(1), 106–116 (2013)
- Shoufan, A., Wink, T., Molter, H., Huss, S., Kohnert, E.: A Novel Cryptoprocessor Architecture for the McEliece Public-Key Cryptosystem. IEEE Transactions on Computers 59(11), 1533–1546 (2010)
- Sutter, G., Deschamps, J., Imana, J.: Modular Multiplication and Exponentiation Architectures for Fast RSA Cryptosystem Based on Digit Serial Computation. IEEE Transactions on Industrial Electronics 58(7), 3101–3109 (2011)
- 23. Sutter, G., Deschamps, J., Imana, J.: Efficient Elliptic Curve Point Multiplication Using Digit-Serial Binary Field Operations. IEEE Transactions on Industrial Electronics 60(1), 217–225 (2013)
- 24. Tang, S., Yi, H., Ding, J., Chen, H., Chen, G.: High-Speed Hardware Implementation of Rainbow Signature on FPGAs. In: Yang, B.-Y. (ed.) PQCrypto 2011. LNCS, vol. 7071, pp. 228–243. Springer, Heidelberg (2011)

- Wang, D., Ding, Y., Zhang, J., Hu, J., Tan, H.: Area-Efficient and Ultra-Low-Power Architecture of RSA Processor for RFID. Electronics Letters 48(19), 1185–1187 (2012)
- Chen, Y.: An Implementation of PMI+ on Low-Cost SmartCard. Master's thesis, National Taiwan University (2006)
- 27. Yang, B.-Y., Cheng, C.-M., Chen, B.-R., Chen, J.-M.: Implementing Minimized Multivariate PKC on Low-Resource Embedded Systems. In: Clark, J.A., Paige, R.F., Polack, F.A.C., Brooke, P.J. (eds.) SPC 2006. LNCS, vol. 3934, pp. 73–88. Springer, Heidelberg (2006)